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C^ ' Abstract. In this paper we consider the following problems: how many 

different subsets of S" can occur as set of all length-n factors of a fi- 
>--^ ■ nite word? If a subset is representable, how long a word do we need to 

represent it? How many such subsets are represented by words of length 

5— ( , f? For the first problem, we give upper and lower bounds of the form 

^4^' Q^ in the binary case. For the second problem, we give a weak upper 

■^U( ' bound and some experimental data. For the third problem, we give a 

^isj . closed-form formula in the case where n < t < 2n. Algorithmic variants 

of these problems have previously been studied under the name "shortest 

common superstring" . 
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1 Introduction 



Let w, X, y, z be finite words. If w = xyz, we say that y is a factor of w. De Bruijn 
proved [T] the existence of a set of binary words {Bn)n>i with the property that 
^ \ every binary word of length n appears as a factor of i?„ (and, in fact, appears 

VO ' exactly once in i?„). Here we are thinking of Bn interpreted as a circular word. 

y^ \ For example, consider the case where n = 2, where we can take B2 — 0011. 

Interpreted circularly, the factors of length 2 of B2 are 00, 01, 11, 10, and these 
factors comprise all the binary words of length 2. 

However, not every subset of {0, 1}" can be represented as the factors of some 



Zfi '. finite word. For example, the set {00, 11} cannot equal the set of all factors of 

any word w — interpreted in the ordinary sense or circularly — because the set 
of factors of any w containing both letters must contain either 01 or 10. 

This raises the natural question, how many different non-empty subsets S of 
{0, 1}" can be represented as the factors of some word w7 (Note that, unlike [7], 
we do not insist that each element of S appear exactly once in w.) We give upper 
and lower bounds for this quantity for circular words, both of the form a^ . Our 
upper bound has a = v^ =1.78 while our lower bound has a — \f2 = 1.41. 

If the set of length-n factors of a word w (considered circularly) equals S", we 
say that w witnesses S. We study the length of the shortest witness for subsets 
of {0, 1}", and give an upper bound. 

Restriction on the length of a witness leads us to another interesting problem. 
Let T{t,n) denote the number of subsets of {0, 1}" witnessed by some word of 
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length t > n.ls there any characterization of T(t^ n)? We focus on ordinary (non- 
circular) words for this question and derive a closed-form formula for T{t, n) in 
the case where n < t < 2n. 

Algorithmic versions of related problems have been widely studied in the 
literature under the name "shortest common superstring" . For example, Gallant, 
Maier, and Storer |4| proved that the following decision problem is NP-complete: 

Instance: A set 5" of words and an integer K. 

Question: Is there a word w of length < K containing each word in S (and 

possibly others) as a factor? 

However, the combinatorial problems that we study in this paper seem to be 
new. 



2 Preliminaries 

Let S — {0, 1} denote the alphabet. Let Fn{w) denote the set of length-n factors 
of an ordinary (non-circular) word w, and let Cn{w) denote the set of length-n 
factors of w where w is interpreted circularly. For example, ii w — 001, then 
F2{w) = {00,01}, while ii w = 001 is interpreted circularly, then C2{w) = 
{00,01,10}. 

We say that a word w witnesses (resp., circularly witnesses) a subset S of 
17" if Fn{w) = S (resp., C„(w) = S). A subset S of 17" is representable (resp., 
circularly representable) if there exists a non-empty word (resp., circular word) 
that witnesses S. Let i?„ denote the set of all non-empty representable subsets of 
i7" , and let i?„ denote the set of all non-empty circularly representable subsets 
of 17". 

Let sw(S') (resp., scw(5)) denote the length of the shortest non-circular wit- 
ness (resp., circular witness) for 5". Let //„ (resp., Vn) denote the maximum 
length of the shortest non-circular (resp., circular) witness over all representable 
subsets of i7" . 

A de Bruijn word Bn of order n over the alphabet 17 is a shortest circular 
witness for the set -17". It is known [1] that the length of a de Bruijn word of 
order n over 17 is 2". 

For convenience, we let wli] denote the i'th letter of w and w[i..j] denote 
the factor of w with length j ~ i + 1 that starts with the i'th letter of w. Thus 
w = w[l..n\ where n — \w\. 

3 Bounds on the size of Rn 

In this section, we give lower and upper bounds on the size of i?„, both of which 
are of the form a^ . Our lower bound has a = v2 while our upper bound has 
a = \/lO. Note that our lower bound also works for the size of i?„, since every 
circularly representable subset is also representable. 
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3.1 Lower bound 

Our argument for the lower bound derives from constructing a set of circularly 
representable subsets. 

Proposition 1. Letbn be any de Bruijn word of order n. Then |C„+i(fe„)| ~ 2". 

Proof. Every de Bruijn word of order n is of length 2"; thus there are 2" length- 
(n + 1) factors of 6„ (considered circularly). These length- (n + 1) factors are 
pairwise distinct, for if ui G Z'"+^ appears more than once as a factor of 6„, then 
■u;[l..n] appears more than once as a factor of 6„. However, every length-n factor 
appears only once in 6„, a contradiction. Hence |C„+i(&„)| = 2". D 

Lemma 2. Given a de Bruijn word bn, let Y denote the set Z'"+^\C„+i(&„). 
For any y <E Y , the set {y} U C„+i(&„) is circularly witnessed by a word w for 
which both the length-2" prefix and the length-2" suffix equal bn- 

Proof. We construct such a witness for {y} U C„+i(6„). 

Let t = b„bnb„bn. Let yi = y[l..n] and y2 — J/[2..n + 1]. Let ii denote the 
index of the first occurrence of j/i in t; namely, the index ii is the minimal integer 
such that yi = t[ii..ii+n—l]. Let ^2 denote the index of the last occurrence oft/2 
in i; namely, the index 12 is the maximal integer such that 2/2 — t[i2..i2 -\-n— V\. 

We argue that the first occurrence of j/i does not overlap the last occurrence 
of j/2- We have ii < 2", since every possible factor of length n appears in the 
circular word 6„. Similarly, we obtain 12 > 3 • 2" — n. Thus we have 

ii + n - 1 - J2 < -2 • 2" + 2n - 1 < 0, 

and hence the first occurrence of j/i does not overlap the last occurrence of 2/2- 
Now consider the circular word 

ty = bnbnt[l..ii - l]t[ii..ii +n- l]t[i2 +n- l]t[i2 + n..2"+^]6„6„. 

We argue that ty is a witness for {y}UCn+i{bn) ■ For one direction, every element 
of {y} U Cn+i{bn) appears as a length-(ri + 1) factor of ty. This is a consequence 
of the following two facts: 

1. b„bn witnesses C„+i(6„). 

2. t[ii..ii + n — l]t[i2 + n — 1] = y[l..n]y[n + 1] = y. 

For the other direction, we can see that all factors of length n + 1 m ty are 
elements of {y} U C„+i(6„) by inspection. Note that the length-2" prefix and 
the length-2" suffix of ty both equal 6„. Hence we conclude that there exists a 
word for which the prefix and the suffix equal 6„ and this circular word circularly 
witnesses {y} U C„+i(6„). D 

Example 3. Let n — 2. One of the de Bruijn words of order 2 is 62 = 0011. 
We have C3(62) = {001,011,110,100}. Thus Y = {000,010,101,111}. Let y = 
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010. The following circular word demonstrates that the set {y} U C„+i(6„) is 
representable: 

ioio = (00110011)( )( 01 )( )( Oil )(00110011). 

6262 t[l..ii-l] t[ii..ii+n-l]=yi t[i2+n-l] t[i2+n..2'^+^] 6262 

Proposition 4. Given a de Bruijn wordbn, letY denote the set Z!"^^\Cn+i{bn)- 
For any subset S QY, the set SU Cn+i(b„) is a circularly representable subset 
ofE"+\ 

Proof. We have proved this proposition for the case where |S'| = 1 by Lemma [H 
Now we turn to the general case. Let S — {si, S2, ■ • ■ , Sm}- By Lemma [2j for 
each 1 < i < 771, there exists a circular word ti that witnesses {si} U C„+i(&„) 
and both the prefix and the suffix of ti equal 6„. We argue that the circular word 
ts — ^1^2 ■ ■ ■tm witnesses 5* U C„+i(6„). 

First, for any 1 < i < m, Si appears in ti and thus in ts- Moreover, every 
element of C„+i(&„) appears in the prefix of ts'- bnhn- Thus, it suffices to show 
that every length- (n + l) factor oits is a member of S'UC„+i(&„). This is shown 
by the fact that for any 1 < i < to, both the suffix of ti and the prefix of i^+i 
equal 6„, which implies that the concatenation of ti and i^+i does not produce 
any new factor of length n + 1 in tg. 

Thus, we conclude that for any subset S of y, there exists a witness for the 
set 5'UC„+i(6„). D 

Corollary 5. A lower bound for the size of Rn+i is 2^ — v2 

3.2 Upper bound 

An obvious upper bound for |i?„| is 2^ , since J?„ C 2^ , where |2^ | = 2^ .In 
this section, we will show that a tighter upper bound is a^ , where a — VlO. 

Definition 6. Let S C S"+^ and T C £■". We say that S is incident on T if 
there exists a circular word w such that w witnesses both S and T . 

Example 1. For example, we fix n = 4. Let w — 0110. Then w is a witness for 
the set S = {0110, 1100, 1001, 0011} e R^ and T = {Oil, 110, 100, 001} e R^. It 
follows that S is incident on T . Note that w' = 01100110 is also a witness for S, 
and a witness for T as well. 

In fact we can argue that if S is incident on T, then every word that witnesses 
S also witnesses T. 

Proposition 8. Every set S € Rn+i is incident on exactly one set in _R„. 

Proof. Let T — {t G Z"" : 3w S S such that i is a length-n prefix or suffix of w}. 
Then a word w which witnesses S also witnesses T. Thus S is incident on T. 
Moreover, if S is incident on T and T', then every witness of S must also witness 
T and T'. Thus we have T = T' . So we conclude that every set 5* G Rn+i is 
incident on exactly one set in Rn. D 
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Now we give a partition of Rn+i- Let 

Rn^i[T] — {S E Rn+i ■ S is incident on T}. 

Proposition |S] implies that {Rn+i[T]}Ti£E" is a pairwise disjoint partition of the 
set i?„+i. Namely, (1) for every Ti ^ T2, we have Rn+i[Ti] n Rn+i[T2] = and 
(2) Urefl,, Rn+i[T] = Rn+i- 

Thus we have |7?„+i| = X^ts-S" I^"+i[-^]I- ^^ ^^ SP^'^ ^^ upper bound for 
|i?„+i|, it suffices to give a upper bound for the size of i?„+i[T]. 

Definition 9. Let x be a word of length n. We say that P^ — {Ox, Ix} is a pair 
of order n w.r.t x, that Sx — {Ox, la;,a;0,a;l} is a skeleton of order n w.r.t. x, 
and Nx = {OxO, 0x1, IccO, 1x1} is a net of order n w.r.t x. We also say that a 
set S contains P^ (resp., Sx and Nx) if Px ^ S (resp., Sx ^ S and Nx C S). 

For any T C Z'", let <t{T) denote the number of skeletons of order n — 1 in 
T and let p{T) denote the number of pairs of order n — 1 in T. We have the 
following proposition: 

Proposition 10. For any T C T", we have \Rn+i[T]\ < 7'''-'^\ 

Before giving the proof for Proposition 1101 we introduce another definition. 

Definition 11. A set R is feasible for a set T C S"- if there exists S G Rn+i[T] 
such that i? C 5. 

We observe that 17"+^ — [Jxes^-^ ^^ ^^'^ *^^^ ^^^ subset S E 17"+^ is a 
disjoint union of subsets of nets of order n—1. Formally, for any subset S E i7"+^, 
we have S = [JxeS"-^ ^x, where Rx Q Nx. 

Proof (of Proposition\Wj). Let Fx denote the set of feasible subsets (for T) of the 
net Nx. If S* € i?„+i[T], then 5' is a disjoint union of feasible subsets (for T) of 
nets. Thus we have |i?„+i[T]| < Oxei:" l-^^^l- ^^ order to prove this proposition, 
it now suffices to show that for any x E S"^^, the following condition holds. 

— iiSx CT, then \Fx\ < 7; 

— otherwise \Fx\ < 1. 

For any x E i7"~^, we consider all the possible feasible subsets of Nx- Let F 
denote any feasible subset of Nx ■ 

— For the first case where Sx C T, we have the following properties: 

1. Either 0x0 E F or 0x1 E F since Ox E T; 

2. Either 1x0 e i^ or 1x1 e -F since Ix E T; 

3. Either 0x0 e F or 1x0 G F since xO E T; 

4. Either 0x1 € F or 1x1 E F since xl E T. 

Hence we have at most 7 possible feasible subsets of Nx which are listed as 
follows: {0x0, 1x1}, {0x0, 0x1,1x1}, {0x0, 1x0, 1x1}, {0x0, 0x1, 1x0, 1x1}, 
{0x0, 0x1, 1x0}, {0x1, 1x0}, {0x1, 1x0, 1x1}. Thus |F^| < 7. 
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— For the second case where Sx ^ T, we argue that \Fx\ < 1. Without loss of 
generality, suppose Oa; ^ T. It follows that: 

1. Oa;0 and 0x1 cannot occur in F since Ox ^ T; 

2. 1x0 e F if and only if xO e T; 

3. 1x1 e i^ if and only if xl € T; 
Hence, F is fixed. It follows that \Fx\ < 1. 

By finishing the argument on the above two cases, we conclude that \Rn+i [T] \ < 

7-(T); □ 

Now, we are close to the core part. Instead of computing the number of 
skeletons, which is quite complex, we consider the number of pairs. 

Proposition 12. The size of the set |-R„+i| is bounded by 10^ 

Proof. Let L^^i denote the number of subsets T G Rn, such that \T\ = k and 
p{T) = i. There are in total 2"~^ pairs in E", and we first choose i's pairs from 
them. Then, we choose the other k — 2i elements which do not form any pair 
from the remaining 2"^^ — i elements. Thus, we have 

Note that k > 2i since a set of k elements can contain at most | pairs and 
the term Lk^i vanishes when k — 2i > 2"~^ — i. Thus we have 

2" 5 
\Rn+l\^ Y. \Rn+l[T]\<J2J2Lk,J\ 
T£S" fc=0 i=0 

The inequality holds since we count the number of pairs instead of the number of 
skeletons and the number of pairs is always greater than or equal to the number 
of skeletons. Then we can see that 

i«....i ^ EE C7") cn ^:y-'- <- £ C";>' £ (T' ,:>'-" 

by writing L^^i in closed form. Note that 

k=2i ^ ' /c=0 ^ ' fc=0 ^ 



So we have 



-1 
^2' 



^ ^On—l 



^=0 ^ ' 



Rn+l\<J2 rS^" -^-10 



-.2"- 



D 



Proposition[T^ directly implies the upper bound we claimed in the beginning 
of this section. 
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4 Shortest witness 

Recall that /i„ (resp., Vn) is the maximum length of the shortest non-circular 
witness (resp., circular witness) over all subsets of S". The quantities of /i„ and 
Vn are of interest since we can enumerate all sequences of length less than or 
equal to /i„ (resp., Vn) in order to list all the representable (resp., circularly 
representable) subsets of 17". In this section we obtain an upper bound on /i„ 
and Un- 

We need the following result of Hamidoune \^, Prop. 2.1]. Since the result is 
little- known and has apparently not appeared in English, we give the proof here. 
By a Hamiltonian walk we mean a closed walk, possibly repeating vertices and 
edges, that visits every vertex of G. 

Proposition 13. Let G — {V, E) be a directed graph on n vertices. If G is 
strongly connected (that is, if there is a directed path from every vertex to ev- 
ery vertex), then there is a Hamiltonian walk of length at most \_{n + 1)^/4J . 
Furthermore, this bound is best possible. 

Proof. Let L be a longest simple path in G. (A simple path does not repeat 
edges or vertices.) Let V — L — {vi : I < i < k}. Let vq be the last vertex in L 
and Vk+i be the first vertex in L. Let Li be a simple path from Vi to Ui+i. Then 
a Hamiltonian walk W is obtained by following the edges in Lo,Li, . . . ,Lk, 
and then those in L. So the number of edges in W is at most {k + 2)\L\ = 
\L\{n -I- 1 — |L|). But it is easy to see that r{n + 1 — r) is maximized when 
r = \n/2] , so r{n + 1 — r) = [{n + 1)^/4J , as claimed. 

To see that this bound is best possible, consider a graph where there is a 
directed chain of [n/2j vertices, where the last vertex has a directed edge to 
\n/2] other vertices, and each of those vertices have a single directed edge back 
to the start of the chain. The shortest walk covering all the vertices traverses 
the chain, then an edge to one of the other vertices, then a single edge back, and 
repeats this [n/2] times. The total length is then ([n/2j-f-l)[n/2] = [(n-|-l)V4j. 
So the bound is tight. D 

From this we immediately get 

Proposition 14. An upper bound for (in and Vn is 2^""^ + 2"^^. 

5 Numerical results 

It is not feasible to enumerate every single word to verify whether a subset is 
circularly representable (or non-circularly representable). For this reason, we 
exploit ideas from graph theory. 

Formally, we define Gn — {Vn,En), where 

Vn ^{{S,u,v) : S C S" and u,v e S"} and 

En = {{{S,u,v),{SU{x},u,x)) :5C S'\ u,v,x(z E'\ and w[2..n] =x[l..n~l]}. 



VIII Shuo Tan and Jeffrey Shallit 

We say that a node (S*, u, v) is valid if S is witnessed by a non-circular word w 
for which the length-n prefix is u and the length-n suffix is v. 

We use a breadth-first search strategy to compute all the possible valid nodes 
in Gn- Let / denote a subset of nodes {({w}, u,u) : a G E"} in G„. Nodes in G„ 
that are connected to any node in / can be proven valid by induction. Thus, a 
breadth-first search begins with the subset / and enumerates all nodes that are 
connected to nodes in /. 

The relation between valid nodes in G„ and non-empty representable subsets 
of order n is that any subset S C S" is representable if and only if there exist 
u,v € S" such that {S, u, v) is valid. This relation can be proved by induction. 
Similarly, any subset S C 17" is circularly representable if and only if there exists 
u € i7" such that (5, m, u) is valid and the minimum distance d between (S', u, u) 
and nodes in / satisfies the inequality d> n — 1. 

With the above properties, we can enumerate all the possible non-empty 
representable (or circularly representable) subsets of order n. Our results are 
shown in the following table. The last two columns give words w of length z/„ 
(resp., /i„) for which no shorter word witnesses Cn{w) (resp., Fn{w)). 



n 


|i?n| 


\Rn\ 


Vn 
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77 
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6 Fixed-length witnesses 



We now turn to a related question. We fix a length n and we ask, how many dif- 
ferent subsets of 17" can we obtain by taking the (ordinary, non-circular factors) 
of a word of length tl We call this quantity T{t,n). As we will see, for t < 2n, 
there is a relatively simple answer to this question. 

In order to compute T(t,n), we consider the number of words that witness 
the same subset of 17". Suppose S C 17". Let Ct{S) denote the number of words 
of length t that witness S. Then we have 



r(i,n) = 2*- J2 iCtiS)-l) 



sex" 

Ct(S)>l 



It suffices to characterize what subsets S satisfy Ct{S) > 1 and to determine 
CtiS). 

For t < 2n, we have such a characterization by Theorem [15] below. Before 
stating the proposition, we first introduce some notation. 

Let w be a word. Let Pref (w) denote the set of prefixes of w. A period p of 
ui is a positive integer such that w can be factorized as 



w 



s'^s' 



with \s\ = p, s £ Pref(s), and fc > 1. 
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Let n(w) denote the minimal period of w. 

The root of a word w is the prefix of w with length tt{w). Let r{w) denote 
the root of w. Two words w and w' are conjugate if there exist u,w G S* such 
that w = uv and w' = vu; w and w' are root- conjugate if their roots r(u') and 
r(w') are conjugate. 

The following theorem is crucial for our work and of independent interest. 

Theorem 15. Let t, n, k be such that t = n + k,n>k + l, and k > 0. Let w and 
w' be distinct words oflengtht over an arbitrary alphabet. Then Fn{w) — Fn{w') 
iffiriiu) = tt{w') < fc + 1 and w,w' are root- conjugate. 

One direction is easy: if w and w' are root-conjugate with period p < fc + 1, 
then there are p places to begin, and considering consecutive factors of length 
n + p — 1 gives exactly p distinct length-n factors. 

For the other direction, we need three lemmas. 

Lemma 16. ( Fine- Wilf theorem Theorem 1]) Let wi,W2 be two words. Ifwi 
and W2 have a common prefix of length n(wi) + tt{w2) — 1, then r{wi) = riw^)- 

Lemma 17. For any w G S^ , if there exists a factorization w — xyz such that 
xy = yz and x,y,z £ S^ , then w is periodic with t:{w) < \x\. 

Proof. By the Lyndon- Schiitzenberger theorem [S] Lemma 2], there exist u € 
IJ~^,v G S* and an integer e > such that x = uv,y = {uvYu,z = vu. Thus 
w — {uvY'^'^u. Thus w is periodic with ■k{w) < |a;|. D 

Lemma 18. Let t, n, k be integers such that t = n + k, n>k + l, and k > 0. 
Let w be a word of length t with Tr{w) < fc + 1. If w' is any word such that 
Fniw) = Fn{w'), then w and w' are root- conjugate. 

Carpi and de Luca proved a stronger proposition 2, Proposition 6.2] which 
directly implies this lemma. We first introduce some relevant notation from that 
paper. 

A factor s of a word w is said to be right-special in w if there exist two distinct 
symbols a and b such that sa and sb are factors of w. Let R^ denote the minimal 
length m such that there exists no factor of length m that is right-special. 

A factor s of a word w is said to be right- extendable (resp., left- extendable) 
in w if there exists a symbol a such that sa is a factor of w (resp., as is a factor 
of w). Let K^ and 7?^ denote the length of the shortest factor which is not 
right-extendable (resp., left-extendable). 

A word is semiperiodic if R^ < Hw 

Proof (of Lemma \18\) . Carpi proved [2, Lemma 3.2] that tt{w) > i?i„. Also, we 
have Hw > tt{w) since the length-(7r(w) — 1) prefix of w is left-extendable. Thus 
w is semiperiodic. Moreover we have Fn{w) = F„{w') where n > k-\-l > n^w) > 
1 + Rw. Then we can apply [2j Proposition 6.2] to prove this lemma. D 
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Proof (of Theorem \15\) . We give a proof for Theorem 1151 by induction on k. 

The base case is when fc = 0. In this case t = n and thus Fn{w) — {w} and 
Fn{w') = {w'}. Thus w = w'. 

Now we deal with the induction step. We assume the resuh holds for fc — 1 
and we prove it for fc. For convenience, we let Piiw) denote the length-i prefix 
of the word w] let Si{w) denote the length-i suffix of the word w. 

We first consider the case where H^ < n. We have Pn{w) e F„(w) = F„(w'). 
If Pn{w) 7^ Pn{w'), then there exists a € Z" such that apn-i{w) G F„(w'). 
Thus we have apn-i{w) € Fn(w) which leads to the contradiction that H^ > 
\apn-i{w)\ =t. Hence p„(«;) =pn{w'). 

Now let s = w[2..t] and s' = w'[2..t]. Clearly \s\ == |.s'| = t - I. The 
prefix Pn{w) appears only once as a factor of w, otherwise Pn-i{w) is left- 
extendable in w which contradicts the fact that H^ < n. Thus we have Fn{s) = 
Fn{w)\{pniw)} . Similarly we have F„(s') = i^„(u)')\{p„(u;)}. Thus F„(s) = 
Fn{s'). Let fc' = fc-1. Wehavei-1 = n + k-l = n + k' and p > fc + 1 > fc' + l. 
By induction, we have either 

Case 1: s = s'; or 

Case 2: s and s' are root-conjugate and tt{s) = n{s') = p, where p < fc' -|- 1 = fc. 

In Case 1, it follows that w — w' , contradicting the fact that w, w' are distinct. In 
Case 2, we prove that s = s' by showing that their roots are identical. Suppose s 
and s' have a common prefix of length d. We have d > n—1, since w and w' have 
a common prefix of length at least n.lid > p, then the root of s is identical to the 
root of s' . Otherwise, we have the chain of inequalities k > p > d+1 > n > fc + 1, 
which is trivially a contradiction. Thus neither Case 1 nor Case 2 can occur and 
we are done with the case where H^] < n. 

Similarly we can prove the induction step when K^ < n. Thus it suffices to 
consider the case where H^ > n and K^^ > n. We first claim 7r(u') < fc -I- 1. 
There are several cases to settle: 

— The first case is whenp„_i(w) — s„_i(w) and the occurrence of p,i_i(w) and 
Sn-i(w) do not overlap; namely we have w — pn~i{w)Lpn~i{'w), where L G 
i:*. We have the inequality n-Hfc = i = \w\ = 2|p„_i(w)| + |L| = 2(7t,-1) + |L|. 
Thus \L\ ^ k + 2-n. Hence tt{w) < \pn~i{w)L\ =n-l + k + 2-n=k + l. 

— The second case is when p„_i(w) — s„_i(it;) and these occurrences overlap. 
Formally we put it as follows: there exist x,y, z (z S^ , such that pn^i{w) = 
xy = yz and w = xyz. It follows that t:{w) < |a;| < fc -|- 1 by Lemma [TT] 

— The last case is when pn-i{w) ^ s„_i(w). Let ip denote the index of the last 
occurrence of Pn-iiw)] namely ip — sup{j > : Pn-i{w) = w[i..i + n — 2]}. 
Note that ip > since pn-i{w) is left-extendable and ip < t — n + 2 since 
Pn-i{w) y^ Sn-i{w). Thus, the first occurrence oi Pn-i{w) (the prefix of w) 
overlaps the last occurrence of p„_i(u'). By Lemma [TTl we get that wi = 
w[l..ip + n — 2] is periodic with 7r(wi) < ip — 1. Similarly we let iq denote 
the index of the first occurrence of s„_i(u') and W2 = w[iq..t\. We have 
<iq <t — n + 2 and tt{w2) < t — n + 2 — iq. The factors wi and W2 overlap 
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for at least |wi| + \w2\ — t > Tr{wi) + tt{w2) — 1 symbols. Let D denote the 
overlap of wi and W2- We have \D\ > tt{wi) + tt{w2) — 1- Also tt{wi) is a 
period of D since \D\ > Tr(wi) and D can be factorized as 

D = d}d' , where d is conjugate to the root of wi, d' e Pref((i), and I > 1. 

By Lemma [TBI the overlap D has the same root as W2. Since root-conjugacy 
is an equivalence relation, we have wi and W2 are root-conjugate. It follows 
that w is periodic with 'k{w) = 7r(u'i) < fc + 1. 

Finally by Lemma 1181 we get that w and w' are root-conjugate and their 
periods 7r(w) — 7r{w') < fc + 1. By all cases, we finish the induction and complete 
the proof of Theorem [15] D 



The following corollary gives T{t, n) when t < 2r 



t-n+l 

Corollary 19. For n < t < 2n, we have T{t,n) = 2* - J2 ^ J2 t^il)'^'^ > 

k=l d\k 

where /i(-) is the Mobius function. 



Proof. Let k = t — n. We have n > t — n + 1 = fc-t-1. By Theorem [151 '^e 
know that for any set S C E", Ct{S) > 1 if and only if there exists a word w 
that witnesses S with tt{w) < fc + 1. In this case we have Ct{S) — tt{w); that 
is, the set of words that witness S is the same as the set of the words that are 
root-conjugate to w. Thus each S such that Ct{S) > 1 corresponds to a set of 
root-conjugate words, which can be represented by their lexicographically least 
roots (the Lyndon words). 

Thus we have 

T(t,n)=2*- Y. iCt(.S)-l)=2'~ Y. W^)-l) 

SEE" ra is a Lyndon word 

Ci(S)>l ,r(™)<fc+l 

k+1 



2*-^(z-l).L(z), 



i=l 



where fc = t — n and L(i) = i ^ ^(^)2'* is the number of Lyndon words of length 

d\i 

i. D 



Example 20. To finish this section, we give a table listing some numerical results 
for T{t,n). The numbers in bold follow from Corollarv[TOl 
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n \ 
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2 
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5 


6 


7 


8 


9 


10 


11 


12 


13 


14 


15 


16 


1 


2 


3 


3 


3 


3 


3 


3 


3 


3 


3 


3 


3 


3 


3 


3 


3 


2 




4 


7 


11 


12 


12 


12 


12 


12 


12 


12 


12 


12 


12 


12 


12 


3 






8 


15 


27 


48 


72 


94 


100 


103 


101 


103 


101 


103 


101 


103 


4 








16 


31 


59 


114 


216 


391 


677 


1087 


1621 


2246 


2928 


3595 


4235 


5 










32 


63 


123 


242 


474 


933 


1795 


3421 


6399 


11682 


20704 


35914 


6 












64 


127 


251 


498 


986 


1965 


3899 


7709 


15171 


29710 


57726 


7 














128 


255 


507 


1010 


2010 


4013 


8001 


15969 


31789 


63256 


8 
















256 


511 


1019 


2034 


4058 


8109 


16193 


32367 


64671 



7 Open Problems and Future Work 

1. In Section [21 we gave lower and upper bounds on |-R„|, both of the form a^ . 
Does the hmit lim |-R„|2^ exist? 

2. Find better bounds for /i„ and t'n. For example, is /i„ < {n— 1)2" for n > 2? 

3. It is easy to see that Theorem [T5] fails for t < fc + 1. Indeed, it is possible 
to have Fn{x) = Fn(y) in this case, and yet 7r(x) ^ TT{y). For example, take 
n = k — 1 so that t = 2k— 1, and consider x = O'^IO'^"^ and y = O'^'^IO'^"-^. 
Then Fn{x) — Fn{y) but ti{x) = fc + 1 and Ti{y) = k. 

The remaining case is n = fc, so that t — 2k. We conjecture that if x and y 
are distinct binary words of length 2n with Fn{x) = Fn{y) then ■k{x) = 7r(y) 
and furthermore x and y are root-conjugate. However, it is possible in this 
case that 7r(x) > n + 1. Furthermore it seems that if 7r(x) > n + 1, then 
X = uvQlv^u and y = uvlOv^u (or vice versa) for some nonempty words 
u, V where w is a palindrome and 7r(a;) = n + \u\. 

As an example, consider x — 010110, y = 011010. Then F3{x) = ^3(2/) = 
{010, Oil, 101, 110} but 7r(a:) = 7r{y) = 5. Here u = 0, v = 1. 
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